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Abstract 

Audiovisual speech stimuli have been shown to produce a variety of perceptual phenomena. Enhanced detectability of 
acoustic speech in noise, when the talker can also be seen, is one of those phenomena. This study investigated whether this 
enhancement effect is specific to visual speech stimuli or can rely on more generic non-speech visual stimulus properties. 
Speech detection thresholds for an auditory /ba/ stimulus were obtained in a white noise masker. The auditory /ba/ was 
presented adaptively to obtain its 79.4% detection threshold under five conditions. In Experiment 1, the syllable was pre- 
sented (1) auditory-only (AO) and (2) as audiovisual speech (AVS), using the original video recording. Three types of 
synthetic visual stimuli were also paired synchronously with the audio token: (3) A dynamic Lissajous (AVL) figure whose 
vertical extent was correlated with the acoustic speech envelope; (4) a dynamic rectangle (AVR) whose horizontal extent 
was correlated with the speech envelope; and (5) a static rectangle (A VSR) whose onset and offset were synchronous with 
the acoustic speech onset and offset. Ten adults with normal hearing and vision participated. The results, in terms of dB 
signal-to-noise ratio (SNR), were AVS < (AVL « AVR « ASR) < AO. That is, AVS was significantly easiest to detect, 
there was no difference among the synthesized visual stimuli, and all audiovisual conditions resulted in significantly lower 
thresholds than AO. To determine the advantage of the AVS stimulus, in Experiment 2, a preliminary mouth gesture was 
edited from the video speech token. This manipulation defeated the advantage for both the original and the edited AVS 
stimulus, while the audiovisual detection enhancement persisted. Overall, the results showed enhanced auditory speech 
detection with visual stimuli but no advantage for a fine-grained correlation between acoustic and optical speech signals. 
© 2004 Elsevier B.V. All rights reserved. 
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1. Introduction 

Audiovisual speech stimuli produce a diverse 
set of perceptual phenomena. For example, under 
low noise and good audibility conditions, phenom- 
ena such as the McGurk effect (McGurk and Mac- 
Donald, 1976) and the ventriloquist effect (De 
Gelder and Bertelson, 2003) show that vision influ- 
ences auditory perception. Under good listening 
conditions, being able both to hear and see the 
talker can also enhance comprehension of the mes- 
sage (Arnold and Hill, 2001; Reisberg et al., 1987). 
Under noisy conditions as well, intelligibility is en- 
hanced when the talker can be seen: Seeing the 
talker can be functionally equivalent to an increase 
in the acoustic signal-to-noise ratio (Sumby and 
Pollack, 1954). Recently, speech detection in noise 
has been shown to be enhanced under audiovisual 
conditions: Grant (2001) and Grant and Seitz 
(2000) showed that a spoken sentence masked by 
acoustic white noise is detectable at a lower sig- 
nal-to-noise ratio (SNR) when the talker's speech 
movements can be seen. 

In Grant's experiments, sentences were pre- 
sented in white acoustic noise, in a two-interval 
forced-choice adaptive paradigm. Participants 
were asked to listen during both intervals and de- 
tect the acoustic stimulus sentence, which was pre- 
sented in only one of the intervals. The sentences 
were presented under auditory-only (AO) and 
audiovisual (AV) conditions. The mean improve- 
ment (AV — AO threshold) in the detection 
threshold was 1.6dB SNR (0.8-2.2 dB SNR) 
(Grant and Seitz, 2000). Similar results were ob- 
tained in (Grant, 2001). In the former study, a con- 
trol experiment examined whether reading the text 
of each sentence prior to a detection trial also en- 
hanced the detection threshold. A mean improve- 
ment of 0.5 dB SNR (0.33-0.78), which was 
statistically significant, was obtained when the sen- 
tences were read in advance. However, visual 
speech was significantly more effective for enhanc- 
ing the threshold than was reading. The reading ef- 
fect was attributed to a reduction in stimulus 
uncertainty. 

In order to explain the AV detection enhance- 
ment effect, Grant calculated correlations between 
speech amplitude and the area of the mouth open- 



ing. The rationale for undertaking these correla- 
tions came from studies showing systematic 
relationships between the cross-sectional area of 
the front cavity of the vocal tract and the acoustic 
speech amplitude (Stevens, 1998). Pearson correla- 
tions for RMS energy of stimulus sentences versus 
area of mouth opening were in the range of 0.35- 
0.52 (Grant and Seitz, 2000), or 12-27% variance 
accounted for. Somewhat higher, but not statisti- 
cally significant changes in improvements in the 
correlations were obtained when the analysis used 
speech that was bandpass filtered in the region of 
the second formant. 

Grant (2001) reported higher average local cor- 
relations of 0.82, or 67% variance accounted for, 
when the analysis was focused on restricted por- 
tions of the stimulus with the highest amplitudes 
and the acoustic signal was restricted to the region 
of the second formant. These local high correla- 
tions were forwarded as the driver for the audio- 
visual speech detection enhancement effect. 
Grant (2001) and Grant and Seitz (2000) sug- 
gested that the primary mechanism of the AV 
detection enhancement depends on perception of 
brief high positive correlations between lip area 
and acoustic amplitude peaks. By observing the 
visual stimulus, the perceiver was theorized to be- 
come "alerted to temporal, and possibly spectral, 
locations of the acoustic noise-plus-speech stimu- 
lus where the S/N [SNR] is most favorable for 
detecting the speech. By this account, visually 
congruent speech information may serve to direct 
auditory attention, thereby reducing temporal and 
spectral uncertainty" (Grant and Seitz, 2000, 
p. 1206). 

A problem with this explanation is that the time 
course of neural processing varies across stimulus 
attributes within sensory-perceptual systems and 
across sensory-perceptual systems, even leading 
under certain conditions to stimulus features being 
erroneously bound together (Moutoussis and 
Zeki, 1997; Treisman, 1996; von der Malsburg, 
1995; Zeki, 1998). Yet, as Grant points out (Grant, 
2001; Grant and Seitz, 2000), the detection of 
speech in noise would have to rely on brief por- 
tions of speech whose amplitude briefly exceeds 
the noise background. If visual processing of 
mouth gestures were to direct auditory attention, 
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visual processing would have to be fast enough to 
extract the gesture while the auditory system still 
has access to the brief acoustic peak. 

Many temporal constraints of the central ner- 
vous system are known. Neural processing times 
through the auditory and visual pathways from 
the periphery to the first several levels of the cortex 
(Mesulam, 1998) set constraints on when and 
where auditory and visual speech information 
could possibly interact neurophysiologically (Sch- 
roeder and Foxe, 2002). 

The first volley of stimulus driven activity into 
the auditory core cortex (the entry point to cortex 
for auditory information) occurs around ll-20ms 
post-stimulus onset (Steinschneider et al., 1999; 
Yvert et al., 2001), and the conscious auditory 
speech percept appears to develop within 150— 
200ms post-stimulus onset (Naatanen, 2001). In 
comparison, intra-cortical recordings in VI /V2 
(the entry point to cortex for visual information) 
have shown the earliest stimulus-driven response 
to be at approximately 56-60 ms (Foxe and Simp- 
son, 2002; Krolak-Salmon et al., 2001). Trans-cor- 
tical processing — processing required to extract 
stimulus features — requires time (Schroeder and 
Foxe, 2002). Evidence suggests that the latency 
for combining visual form and motion at the level 
of the cortex is at least 100 ms, and face motion 
processing might require latencies closer to 
170 ms (Puce and Perrett, 2003). These estimates 
of processing times suggest that by the time that 
a visual mouth gesture has been processed, the 
acoustic stimulus is likely buried in the noise back- 
ground again. At a cortical level, the temporal 
dynamics of auditory and visual perceptual stimu- 
lus processing do not seem well suited to using 
brief fine-grained correlations for detection. 

An alternate explanation for the AV speech 
detection enhancement effect — one that is not 
dependent on perceiving complex visual speech 
features, such as mouth area, but would require 
merely the co-presentation of an auditory and 
visual stimulus — is excitatory-excitatory conver- 
gence, such as the type demonstrated by multisen- 
sory neurons in the superior colliculus (Meredith, 
2002; Stein and Meredith, 1993). The superior col- 
liculus is a sub-cortical structure in the bottom-up 
pathway, prior to the higher cortical levels of stim- 



ulus feature analysis, and is concerned with the 
detection of events in extra-personal space (Mere- 
dith, 2002; Stein and Meredith, 1993). Superior 
colliculus neurons can respond weakly to AO or 
visual-only stimulation but very strongly to their 
combination, frequently super- additively at 
threshold levels. Their responses are sensitive to 
the temporal relationship of multisensory stimuli, 
with responses greatest when the stimuli occur 
within 100ms of each other (Meredith et al., 
1987). Rather than relying on speech feature pro- 
cessing, the AV speech detection effect could rely 
on early sub-cortical processing that is not special- 
ized for speech (Bernstein et al., 2004) and does 
not require top-down attention. Other AV phe- 
nomena listed earlier in this introduction are also 
not all specific to speech, and some effects might 
engage more than one perceptual mechanism. 
For example, the ventriloquist effect, which in- 
volves mislocation of an auditory stimulus to that 
of a visual stimulus, can be demonstrated with 
both speech and non-speech stimuli (De Gelder 
and Bertelson, 2003) and has been attributed to 
early bottom up processing (Colin et al., 2002). 

1.1. The current study 

With the above considerations in mind, the cur- 
rent perceptual study was undertaken to investi- 
gate the stimulus conditions that lead to the AV 
speech enhancement effect. The study was designed 
to test whether enhanced auditory speech detec- 
tion depends on seeing a speech stimulus, or 
whether a simple, non-speech visual stimulus is 
sufficient. The study was designed also to test 
whether the effect relies on processing a fine- 
grained correlation between the area of a dynamic 
visual stimulus and the acoustic amplitude enve- 
lope, or whether merely presenting a constant 
visual stimulus during a speech token is sufficient 
to achieve enhanced detection. 

The study used an adaptive two-interval forced- 
choice paradigm (Levitt, 1971) to obtain detection 
thresholds for an acoustic speech token /ba/ whose 
level was fixed across trials. The syllable was pre- 
sented in an adaptively adjusted white noise mas- 
ker, where the noise sample was randomly 
selected from trial to trial. The acoustic token 
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was presented in only one of the two intervals 
of each trial, and participants were instructed 
to select the interval in which the syllable 
occurred. 

In Experiment 1, in addition to the AO speech 
token, four different types of visual stimuli were 
used in separate adaptive threshold runs. One of 
the visual stimuli was the face of the talker, re- 
corded at the same time as the recording of the 
audio stimulus. To test whether merely presenting 
a simple visual stimulus during the speech syllable 
could enhance the detection threshold for the /ba/ 
stimulus, a static filled rectangular shape was syn- 
thesized, whose presentation duration was equal to 
that of the audio /ba/, and therefore, had zero cor- 
relation with the acoustic envelope during the 
course of the /ba/ stimulus. A significant effect of 
this stimulus would be consistent with a low-level 
excitatory-excitatory interaction mechanism 
(Meredith, 2002; Stein and Meredith, 1993). 

To test whether the effective stimulus required a 
fine-grained correlation between the amplitude 
envelope of the speech and the area of the visual 
stimulus, related to the mechanism hypothesized 
by Grant, a dynamic rectangle whose horizontal 
extent was correlated with the speech amplitude 
envelope of the /ba/ was generated. The dynamic 
rectangle expanded horizontally, so that it would 
not have the appearance of a mouth opening, 
although its area was correlated with speech en- 
ergy. A fourth visual stimulus was generated to 
capture the same dynamics as the rectangle but 
did have the potential to appear mouth-like. It 
was a filled dynamic Lissajous figure, that is, a 
filled oval shape whose vertical extent was corre- 
lated with the acoustic speech amplitude. Thus, it 
presented an audio-to-visual correlation that 
might have a mouth-like appearance to partici- 
pants; however, if the Lissajous figure were pre- 
sented by itself, it would not convey phonetic 
information. The Lissajous figure tested the possi- 
bility that a very schematic mouth-like gesture, in 
combination with the audio /ba/, could create a 
speech impression that might result in a speech- 
specific effect. That is, a possible outcome would 
be that a similar enhancement would be achieved 
with the Lissajous figure as with the natural video 
token. 



During the AO speech detection threshold runs, 
a fixation cross was presented during each obser- 
vation interval, so as to reduce uncertainty within 
the context of each trial. That is, the fixation cross 
indicated the time periods during which partici- 
pants should attend for the auditory stimulus. 

In summary, if it were the case that merely pre- 
senting a simple non- speech visual stimulus with 
an acoustic speech syllable is sufficient to enhance 
detection thresholds, the static rectangle should 
significantly reduce audio /ba/ detection thresh- 
olds. This result would be consistent with a bot- 
tom-up, excitatory-excitatory mechanism that 
did not require higher-level perception. If, how- 
ever, dynamic properties are required, then the dy- 
namic rectangle should be significantly more 
effective than the static rectangle, implicating high- 
er-level processing of visual stimulus properties. If 
the effect were specific to visual speech, then the 
natural video token should result in the lowest 
detection thresholds. If the dynamic Lissajous 
figure produced thresholds similar to the natural 
video token and lower than the other video stim- 
uli, the implication would be that stimuli need only 
be grossly speech-like and need not convey specific 
phonetic information. Following on the finding 
that there did seem to be a special advantage to 
only the natural visual speech token, a second 
experiment was run to investigate that effect 
further. 



2. Experiment 1 methods 

2.1. Participants 

Eleven participants were recruited, and 10 com- 
pleted the experiment. All were native speakers of 
American English (three males and seven females, 
ages 19-40 years, mean age 26.4 years), with nor- 
mal hearing (hearing thresholds ^15dB HL at 
audiometric test frequencies from 250 to 
8000 Hz) (American National Standards Institute, 
1989). Their speech reception thresholds in noise 
were tested using the Hearing in Noise Test 
(HINT) (Nilsson et al., 1994). Their composite 
HINT scores (comprising measures of noise front, 
noise right, noise left) were normal. Participants 
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were screened for normal vision, and they were 
screened to be average or better lipreaders, as ref- 
erenced to the distribution of performance of a lar- 
ger group of hearing lipreaders (Bernstein et al., 
2000). Screening was used to assure that partici- 
pants were individuals who were likely to be users 
of visual speech information. They gave their in- 
formed consent, and they were paid $10 per hour 
for their participation. 

2.2. Stimuli 

The stimuli for this study were based on a single 
videorecorded /ba/ token. The token was produced 
by an experienced female talker as part of a much 
larger database of syllables. A UVW-1800 SONY 
Betacam SP recorder and a SONY production 
camera were used to make the recording. The nat- 
ural audiovisual speech (AVS) token was used in 
one of the conditions. 

Synthesized video stimuli were generated fol- 
lowing computation of the amplitude envelope of 
the acoustic /ba/ signal. Fig. 1 shows the amplitude 
envelope curve of the acoustic /ba/ token. Fig. 1 
also shows the area of the mouth opening, which 
was obtained by manually selecting the pixels 
within the mouth opening for each video field 
and computing the total for each field. Fig. 1 dem- 



onstrates that the amplitude of the acoustic signal 
rose sharply and slightly prior to the full mouth 
opening. The amplitude peak was extremely brief. 
Fig. 1 shows that in the natural video token, the 
talker also slightly opened and closed her lips in 
advance of the acoustic bilabial release of the Ibl. 
The dynamic synthesized video stimuli were corre- 
lated with the amplitude curve rather than the 
mouth opening area function. The full amplitude 
of the synthesized dynamic stimuli was achieved 
approximately two video frames (67 ms) ahead of 
the open mouth position. The correlation between 
synthesized motion stimuli and the acoustic enve- 
lope was 0.996. The correlation between the natu- 
ral mouth opening and the acoustic envelope was 
0.76, and between the natural mouth opening 
and the synthetic stimuli was 0.77. 

The dynamic Lissajous figure was synthesized 
at the field rate of the video speech, that is, 59.94 
frames/s. Its vertical extent was correlated with 
the speech amplitude envelope (see Figs. 1 and 
2). A dynamic rectangle was synthesized in a sim- 
ilar manner, but its horizontal extent was corre- 
lated with the speech amplitude envelope (see 
Fig. 2). The areas of the dynamic Lissajous and 
rectangle stimuli were equated so that neither 
had an energy advantage. A static rectangle corre- 
sponding to the largest rectangular video frame 
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AVS — Audiovisual Speech 



AVL — Audio Speech with Dynamic Lissajous Figure 



AVR — Audio Speech with Dynamic Rectangle 

Fig. 2. Four video frames from each of the conditions with dynamic video stimuli. The top row is from the natural AVS condition. The 
second row is from the dynamic Lissajous condition (AVL), and the third row is from the dynamic rectangle condition (AVR). 



was also synthesized with temporal duration 
equivalent to the acoustic syllable duration. Dur- 
ing the AO condition, a fixation cross appeared 
on the video monitor during each stimulus obser- 
vation interval in the two-interval forced-choice 
trial. All video stimuli were presented on a SONY 
Trinitron monitor. 

The presentation amplitude used for the audi- 
tory /ba/ did not vary throughout the experiment. 
It was set following a pilot experiment with five 
participants. In the pilot study, the speech was pre- 
sented at 65 dB SPL. But this resulted in unaccept- 
ably high noise levels whenever the speech was 
presented under audiovisual conditions. There- 
fore, the speech level for the current experiment 
was set at 60 dB SPL. 

Computer-generated white masking noise was 
stored in a long audio file. Each interval of mask- 
ing noise was selected at random from the long 
noise file and output through a sound card and a 
calibrated programmable attenuator. The speech 
and noise were mixed in real time and presented 
binaurally over TDH 49 headphones. 



All of the AV stimuli (natural video, synthe- 
sized video, and audio) were transferred to a 
DVD for use during the experiment. The compo- 
nent signal from the original Betacam SP video 
of the /ba/ stimulus was digitized on an ACCOM 
2XTREME real-time digital disk recorder. 
Uncompressed video frames were transferred to a 
PC as individual frame files with a spatial resolu- 
tion of 720 x 486. 

For every AV stimulus, a sequence of uncom- 
pressed frames for the video was built into an 
AVI (audio video interleave) file for software 
MPEG compression. All of the MPEG files for 
the different conditions were transferred to the 
DVD. MPEG Level 2 compression was accom- 
plished using the LIGOS LSX MPEG-Compressor 
(Version 3.5). We have obtained excellent results in 
direct comparisons between DVD and laserdisc, 
suggesting that this level of compression does not 
compromise video quality. The video input format 
was 720 x 480, interlaced with the top field first 
and frame rate of 29.97. For compression, the 
frame sequence was all I frames, with a constant 
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bitrate specified at 7700Kbits/s. The bitrate was se- 
lected so as not to exceed the peak rate allowed by 
DVD when including uncompressed 48-kHz 
locked audio with the video. The MPEG files were 
authored to DVD using the ReelDVD (Version 
2.5.1) software package from SONIC. The result- 
ing DVD contained a single sequential program 
chain, which is required by the Panasonic V7400 
player to allow frame-based searching and access. 
By this method, random access of the stimuli for 
each trial was made possible. The audio /ba/ asso- 
ciated with the different visual stimuli was stored 
in a separate file, uncompressed with a sample rate 
of 48 kHz. As with the video, the audio associated 
with the different trial types was concatenated into 
a single long file for production of the DVD. The 
concatenation of the audio was performed using 
custom software that ensures frame-locked audio 
of 8008 audio samples/5 video frames. 

2.3. Procedure 

There were five different conditions in the exper- 
iment: (1) auditory- only speech (AO); (2) audiovi- 
sual speech (AVS); (3) audio /ba/ with the dynamic 
visual Lissajous figure (AVL); (4) audio /ba/ with a 
dynamic rectangle (AVR); and (5) audio /ba/ 
with a static rectangle (AVSR). Across all condi- 
tions in the experiment, the task was based on a 
two-interval, forced-choice adaptive threshold 
paradigm. In this paradigm, there were two obser- 
vation intervals for each trial, and the partici- 
pant indicated whether the acoustic speech token 
occurred in the first or second interval (see 
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Fig. 3. Trial structure. Each trial comprised two intervals. The 
acoustic /ba/ stimulus could occur in only one of the intervals. 
For trials with visual stimuli, the visual stimulus appeared in 
both intervals. The dots in the figure correspond to the 
temporal jitter in the timing of the two intervals (see text). 



Fig. 3). For each trial, the observation interval in 
which the signal was presented was randomly se- 
lected. The noise masker began before the first 
interval in which a stimulus could occur and ended 
after the second such interval. The onset trigger for 
the noise masker was a signal recorded on the sec- 
ond audio track of the DVD that stored the 
stimuli. 

The adaptive rule that was used to adjust the 
masker noise was as follows: Three correct re- 
sponses and the noise was increased, and one 
incorrect response and the noise was decreased. 
The rule converges on the 79.4% detection thresh- 
old (Levitt, 1971). At the beginning of testing, the 
acoustic signal was -6dB below the level of the 
noise. The step sizes during the adaptive testing 
were adjusted so that at the beginning, step 
changes following the adaptive rule were 3dB. 
Then the step changes were reduced to 2dB until 
the second and third reversals occurred, as speci- 
fied by the adaptive rules. Changes in the noise 
step sizes then followed the schedule of 1 dB until 
the fifth reversal started; 0.5 dB until the seventh 
reversal started; 0.2 dB until the tenth reversal 
started; and 0.1 dB for the final two reversals. 
The threshold was the mean calculated using all 
of the 12 SNR levels at reversal points. 

To prevent participants from relying on the tim- 
ing relationships within the two stimulus observa- 
tion intervals in each of the adaptive trials, a set of 
trials was generated for each condition that varied 
in terms of the onset of the stimulus within the 
observation interval (see Fig. 3). (Only the timing 
of the stimulus presentation was jittered, not the 
relationship between audio and video for AV stim- 
uli.) The total duration of the observation interval 
remained fixed at 64 frames. The stimulus onset jit- 
ter spanned six steps (0-5 frames, represented by 
the dots in Fig. 3), with each step equivalent to 
one video frame (at 33.37 ms/frame). The duration 
of onset jitter was randomly selected. In order to 
hold observation intervals constant across trials, 
whatever number of jitter steps prior to the stimu- 
lus onset was subtracted from 6, and the reminder 
was added following the stimulus. 

The method that was used to obtain the re- 
quired timing relationships for each trial involved 
creating each in advance on the DVD. For each 
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trial type, the sequence of uncompressed frames 
for the video /ba/ or the synthesized visual stimulus 
was built into an AVI (Audio Video Interleave) file 
for software MPEG compression, along with the 
appropriate frames needed to jitter the onset times. 
During the AO trials, a fixation cross was pre- 
sented on the video monitor during each observa- 
tion interval. 

Participants were asked to respond as quickly 
and accurately as possible using a button box for 
which the first interval corresponded to the left 
button and the second interval corresponded to 
the right button. Response times were recorded 
by an external response time clock. On the first 
day of the experiment, participants received a set 
of practice trials in each stimulus condition to 
learn the procedure. During the practice, the same 
adaptive rules were used as during the testing, but 
the step size was maintained at 3dB. Also, during 
practice no more than 10 trials were presented in 
each condition. Each participant was tested in 
each of the conditions a total of eight times. The 
order of conditions was randomized within sets 
of the five conditions, so that each participant 
completed eight randomized sets. The number of 
days that participants required to complete the 
experiment varied between 2 and 6 days. The inter- 
val of time in which the sessions took place varied 
between 2 and 32 days. Ample rest times were 
given when the period of testing was reduced to 
2 days. Testing was administered in a double- 
walled IAC booth. 



3. Results 

3.1. Speech detection thresholds 

Fig. 4 shows the group mean thresholds across 
the five conditions and 10 participants. Examina- 
tion of the figure suggests that thresholds were 
highest in the AO condition, lowest in the AVS 
condition, and intermediate for the other AV con- 
ditions (AVSR, AVL, AVR). Fig. 5 shows the 
mean results for each participant and condition. 
This figure shows that the group pattern held gen- 
erally across participants but with some individual 
variations. 
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Fig. 4. Group means for thresholds across all sessions and 
participants. Means are dB signal-to-noise ratio at the esti- 
mated 79.4% threshold. 
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Fig. 5. Individual means for thresholds in each condition and 
for each participant. Means are dB signal-to-noise ratio at the 
estimated 79.4% threshold. 

Analysis of variance was applied to the results 
in a repeated measures design. The repeated fac- 
tors were condition (AVS, AVL, AVR, AO, and 
AVSR), session (four), and run (two per session). 
There was a significant main effect of condition 
[F(4,6) = 50.49, p < 0.001]. The threshold means 
(in dB SNR) for the five conditions were 
AVS = -18.017; AO = -15.562; AVL = -16.823; 
AVR = -16.833, AVSR = -16.572. There were 
no other significant main effects or interactions. 

Contrast analyses were used to test the specific 
hypotheses of the study. The results of the contrast 
analyses are shown in Table 1 . AO thresholds were 
significantly higher than the thresholds in the four 
other conditions (AVS, AVL, AVR, and AVSR). 
AVS thresholds were significantly lower than in 
each of the four other conditions. There were no 
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Table 1 

Results of the contrast analyses for thresholds in the five conditions of Experiment 1 

AO AVL AV AVR AVSR 

AVL F= 27.506 

p = 0.001 

AV F= 108.791 F= 11.594 - - 

p < 0.001 p = 0.008 

AVR F= 23.314 F= 0.002 F= 19.677 

/? = 0.001 /? = 0.961, *NS /? = 0.002 

AVSR F= 30.925 F = 0.969 F= 21.341 F= 0.962 

/?< 0.001 /? = 0.351, *NS /? = 0.001 /? = 0.352, *NS 

* NS — Not significant. 



significant differences among AVL, AVR, and 
AVSR conditions. 

3.2. Response time measures 

Response times collected during the experiment 
were analyzed to determine whether there were 
patterns that could provide additional insight into 
the participants' performance. Several different 
analyses were performed, using responses to cor- 
rect trials only. If the participant responded before 
the audio began, even if the response was correct, 
that response time was not included in any latency 
analysis. Two measures of central tendency were 
computed, the arithmetic mean and the harmonic 
mean, because response times are known to have 
non-normal distributions and to be sensitive to 
outliers (Ratcliff, 1993). The harmonic mean is 
the reciprocal of the arithmetic mean of the recip- 
rocals of the scores. By using two measures, a bet- 
ter estimate of the stability of the results could be 
achieved. Both measures are reported only when 
the analyses produced different results. 

Response times were entered into a repeated 
measures ANOVA, with stimulus interval (first 
or second), session (4), run (2) and condition (5) 
as the repeated factors. This analysis used data 
from eight of the 10 participants, because there 
were a few missing cells for the other two partici- 
pants. For those participants, data for both 
intervals were not available for some sessions 
and/or conditions. Interval was the only significant 
effect for the mean response time measure 
[F(l,7) = 7.49, p = 0.029], with the second interval 
faster (1593ms) than the first interval (1662ms). 



There were no other significant main effects or 
interactions. Mean response times for the five con- 
ditions were AO, 1677 ms; AVL, 1619 ms; AVS, 
1606ms; AVR, 1605 ms; and AVSR, 1632ms. 
(Note that these long response times were an arti- 
fact of measuring latency from the point corre- 
sponding to the onset of the visual speech 
stimulus.) The advantage to the second interval 
can be attributed to participants having quickly 
determined, particularly during suprathreshold tri- 
als, that the auditory stimulus was not in the first 
interval and could therefore be expected in the 
second. 

The analysis above did not take into account 
the effect of the varying SNRs across the individ- 
ual trials of the adaptive runs. Additional analyses 
were undertaken to determine whether response 
times varied across conditions in the neighborhood 
of the detection threshold. For each participant, 
the response times for trials that were within 
±0.5 dB SNR of the final threshold estimate were 
extracted from the data. Although stimulus obser- 
vation interval was a significant factor in the previ- 
ous analysis, it was not used in this analysis, 
because of empty cells for several of the partici- 
pants who had many errors whenever the stimuli 
were in the first interval. 

A repeated measures ANOVA was carried out 
with condition (5) as the repeated factor. The anal- 
ysis of the mean response time measure failed to 
produce any significant effects. A significant main 
effect for condition was obtained with the har- 
monic means [F(4,6) = 5.161, p = 0.038]. Table 2 
shows the contrast analyses on the harmonic 
means. The significant main effect of condition 
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Table 2 

Results of the contrast analyses for response times estimated with the harmonic means in the five conditions in the experiment 



AO AVL AVS AVR AVSR 



AVL 


F= 6.939 








p = 0.027 






AVS 


F= 1.322 


F= 0.144 






p = 0.280, *NS 


p = 0.713, *NS 




AVR 


F = 15.931 


F= 0.000 


F= 0.140 




p = 0.003 


p = 0.986, *NS 


/> = 0.717, *NS 


AVSR 


F= 2.133 


F= 1.149 


F= 0.247 F= 1.958 




p = 0.178, *NS 


p = 0.312, *NS 


p = 0.631, *NS /i = 0.195, *NS 


These contrast analyses used harmonic mean RTs obtained across 


both stimulus observation intervals for responses within ±0.5 dB 


SNR of the estimated threshold for the run (AT = 10). 




* NS- 


-Not significant. 







was due to the AO condition being slowest and sig- 
nificantly different from the AVL and AVR condi- 
tions. The AVS condition was not different in 
latency from the AO condition. 

Lastly, the response time measure was analyzed 
to determine whether the participants used differ- 
ent strategies across the two intervals as a function 
of condition. The response times in the first versus 
second interval for each condition and participant 
were entered into a repeated measures ANOVA. 
This analysis used an expanded range of SNR val- 
ues (within plus and minus 1.5 dB of the threshold 
for each run). All 10 participants contributed data. 
The main effect of interval was significant 
[F(l,9) = 55.352, p < 0.001] (first interval 13% cor- 
rect responses; second interval 40% correct re- 
sponses). But the main effect of condition was 
not reliable, nor was the interaction between inter- 
val and condition. Overall, response time measures 
suggest that participants did not change their re- 
sponse strategies as a function of condition. 

3.3. Discussion 

Experiment 1 was undertaken to investigate 
whether the AV speech detection enhancement ef- 
fect reported by Grant (2001) and Grant and Seitz 
(2000) is specific to visual speech stimuli, and 
whether it relies on perceptual analysis and atten- 
tion to the fine-grained correlated dynamics of 
an audiovisual speech stimulus. The threshold esti- 
mates obtained in Experiment 1 showed an 
unqualified advantage for audiovisual stimuli. 



But the results with the static rectangle showed 
that the static visual stimulus was sufficient to en- 
hance the detection thresholds of an acoustic 
speech syllable. Animating the visual stimuli as a 
function of the acoustic amplitude envelope of 
the /ba/ syllable did not result in further improve- 
ments to the threshold levels over the static rectan- 
gle. Participants appeared not to benefit from the 
correlation between the dynamics of the speech 
envelope and the dynamics of the Lissajous and 
rectangle shapes. The foregoing results are com- 
patible with the hypothesis that the AV speech 
detection enhancement effect does not require 
high-level analysis of visual mouth features. 

But significantly lower thresholds were obtained 
with the natural AVS stimulus. This result raised 
the questions whether visual speech engages addi- 
tional or different mechanisms than does non- 
speech visual stimuli, or whether the visual speech 
token provided additional useful stimulus infor- 
mation in the threshold task. Fig. 1 shows that 
the natural visual speech movement began with a 
small lip opening followed by lip closure, prior 
to the acoustic bilabial release. The preliminary 
gesture was not in the synthesized visual stimuli, 
and the synthesized visual stimuli were shorter in 
duration, coinciding with the acoustic syllable 
amplitude envelope. The preliminary visible ges- 
ture of the natural stimulus was at a fixed duration 
from the amplitude peak in the acoustic stimulus 
and could have functioned as a cue to the peak's 
location. The relationship between the preliminary 
lip gesture and the acoustic syllable was fixed. The 
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preliminary information could have functioned as 
a pre-cue and provided the advantage obtained 
in the AVS condition. 

Experiment 2 was undertaken to determine 
whether the advantage in the AVS condition was 
due to the preliminary mouth gesture in the natu- 
ral token. The visual token in the AVS condition 
was edited to remove the frames comprising the 
preliminary mouth opening and closing gesture. 
Following the same general methods as in Experi- 
ment 1, four conditions were compared in Experi- 
ment 2: AO, AVS, AVR, and AVSE (audiovisual 
speech edited). 



4. Experiment 2 methods 

4.1. Participants 

Four participants were recruited. Two had par- 
ticipated in Experiment 1, and two had partici- 
pated in pilot experiments. Thus, all had normal 
pure tone averages, normal or corrected-to-normal 
vision, passing or better scores on the lipreading 
screening, and normal HINT scores. Three were 
female. Their ages ranged between 20 and 28 years 
(mean age 25 years). They gave their informed con- 
sent, and they were paid $10 per hour for their 
participation. 

4.2. Stimuli 

The AVS, AVR, and AO were the same stimuli 
as in Experiment 1. The AVSE stimulus was the 
same as the AVS stimulus, except that the visual 
token began with the 25th frame (see Fig. 1, and 
note that the data are presented at the field rate). 
The duration of the AVS and AVSE visual stimuli 
were the same, however, because the 25th frame 
was repeated 25 times initially. 

4.3. Procedure 

The same procedure was followed as in Experi- 
ment 1, except that this experiment had four con- 
ditions (AVS, AVR, AO, and AVSE) rather than 
five. 



5. Results 

Fig. 6 shows the individual mean thresholds in 
Experiment 2 and the individual mean AVS 
threshold in a previous experiment (AVS-P). Mean 
thresholds from Experiment 2 are shown on the 
right of the figure. Examination of the figure sug- 
gests that thresholds were highest in the AO condi- 
tion and lower but similar in all the audiovisual 
conditions. Most striking is the high similarity be- 
tween AVS and AVSE thresholds within individ- 
ual participants. 

A repeated measures analysis of variance was 
applied to the results with the factors of condition 
(AVS, AVSE, AVR, and AO), session (four), and 
run (two per session). There was a significant main 
effect of condition [F(3,9) = 22.07, p < 0.001]. The 
threshold means (in dB SNR) for the four condi- 
tions were AVS = -17.149; AVSE = -17.091; 
AVR = -16.336; and AO = -15.516. There were 
no other significant main effects or interactions. 

Contrast analyses showed that AO thresholds 
were significantly higher than the thresholds in 
the three other conditions (AVS, AVSE, and 
AVR): AO versus AVR [F(l,3) = 11.559, p = 
0.042]; AO versus AVS [F(l,3) = 68.568, p = 
0.004]; and AO versus AVSE [F(l, 3) = 69.739, 




Fig. 6. Individual means for thresholds in each condition of 
Experiment 2 with the participants' previous AVS (AVS-P) 
thresholds. Participants are designated S1-S4. Experiment 2 
means shown on the right are across all threshold estimates for 
the four participants. Means are dB signal-to-noise ratio at the 
estimated 79.4% threshold. Note: "*" Designates participant 
was in Experiment 1. "**" Designates participant was in a pilot 
study in which the 71% threshold (a more difficult level) was 
tested to obtain the AVS-P threshold. "***" Designates 
participant was in a pilot study in which the 50% threshold 
was obtained, and the speech was presented at 65 dB SPL (yet 
more difficult). 
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p = 0.004]. But there were not any other significant 
differences among audiovisual conditions. 

The results in Fig. 6 suggest that not only was 
there no difference between AVS and AVSE in 
Experiment 2, but in addition, the removal of the 
preliminary mouth gesture resulted in higher 
thresholds than these participants had obtained 
with the AVS stimulus previously. When these 
participants had been tested previously, they ob- 
tained lower AVS thresholds, even under more dif- 
ficult testing conditions. That is, although all of 
the participants in Experiment 2 received the same 
adaptive threshold rule, and the previous results 
for Participants 2 and 4 used the same threshold 
rule, the previous results shown in Fig. 6 for Par- 
ticipants 1 and 3 were obtained under more diffi- 
cult conditions: Participant 1 had been tested in 
a pilot experiment in which the 71% adaptive 
threshold rule was used. The previous results for 
Participant 3 were from a pilot experiment in 
which the 50% adaptive threshold rule was used, 
and the speech acoustic stimulus was presented 
at65dBSPL. 

5.7. Discussion 

The results of Experiment 2 can be interpreted 
as evidence that the audiovisual effect of Experi- 
ment 1 is replicable. But the additional advantage 
for visual speech appears to be fragile. Participants 
in Experiment 1 apparently took advantage of the 
temporal relationship between the visible prepara- 
tory mouth motions and the consonant gestures, 
which was a reliable temporal cue to the location 
of the upcoming acoustic amplitude peak. This 
advantage seems to have been defeated in Experi- 
ment 2, in which participants' AVS and AVSE 
thresholds were not different from the AVR 
thresholds, and in which their AVS thresholds rose 
relative to previous AVS thresholds. 

During Experiment 2, eight thresholds were ob- 
tained in the AVS and eight in the AVSE condi- 
tions. The order of conditions was randomized 
within sets of the four conditions (i.e., AO, 
AVR, AVS, and AVSE), so that each participant 
completed eight randomized sets. Thus, during a 
set of four thresholds, with randomly ordered con- 
ditions, the preliminary mouth gesture was present 



for only one of the threshold runs. A possible 
explanation for the results in Experiment 2 is that 
when the preliminary gesture was no longer reli- 
ably available across the context of the experi- 
ment, the participants no longer took advantage 
of it when it was present. That is, they no longer 
attended to the preliminary mouth gesture. On 
the other hand, the advantage of AV stimuli rela- 
tive to the AO condition remained a robust, statis- 
tically significant difference. This result is 
compatible with the hypothesized bottom-up excit- 
atory-excitatory detection mechanism, which to be 
effective should operate independent of stimulus 
identity. 

6. General discussion 

In Experiment 1, the AO detection thresholds 
were significantly higher than the thresholds ob- 
tained with AV stimuli. But the natural AV speech 
token produced significantly lower thresholds than 
those obtained with the synthesized video stimuli 
(AVR, AVSR, and AVL). Importantly, there was 
not any difference among the synthesized video 
stimuli, suggesting that the dynamics of the AVR 
and AVL stimuli did not contribute additional 
advantage beyond the static rectangle. Experiment 
2 compared the AVS stimulus to the AVSE stimu- 
lus for which the preliminary mouth gesture was 
removed, but the total stimulus was equal in dura- 
tion. When the preliminary speech gesture was re- 
moved, both the original AVS and the AVSE 
stimuli produced similar thresholds, but ones that 
were higher than those obtained earlier by these 
participants. This result suggests that the advan- 
tage of the preliminary gesture depended in part 
on the experimental context. 

Taken together, the two experiments do not 
support the theory forwarded by Grant (2001) 
and Grant and Seitz (2000), that the primary 
mechanism of the audiovisual speech detection 
enhancement effect is perception of the fine- 
grained correlation between lip area and acoustic 
amplitude peaks, and that this correlation is used 
by top-down attentional mechanisms. First, the 
data with the static rectangle in Experiment 1 show 
that fine-grained correlations between the sound 
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and the lips are not required for the effect. Second, 
the additional speech advantage found in Experi- 
ment 1 can be interpreted as a reduction in stimu- 
lus uncertainty by providing a pre-cue to the 
location of an upcoming acoustic peak, rather 
than an effect of audiovisual correlation. 

Previously, Schwartz et al. (2002) conducted a 
syllable identification experiment in — 9dB SNR, 
using babble noise. Performance was extremely 
inaccurate except for the identification of conso- 
nant voicing. This was attributed to making use 
of the temporal cue afforded by preliminary mouth 
gestures. A top-down attentional strategy was 
hypothesized to be the mechanism responsible. In 
(Schwartz et al., 2003), the initial lip gesture was 
specifically investigated and found to improve 
audiovisual speech identification, even though the 
visual information specific to the consonant iden- 
tity was replaced by a video sequence that was 
fixed across syllables. When, in their second exper- 
iment, they replaced the lip gestures with a red bar 
that increased and then decreased in height, no 
audiovisual gain was obtained. Given that theirs 
was an identification experiment and was con- 
ducted at a more favorable SNR, generalization 
across theirs and our experiments is hazardous. 
Nevertheless, both supported some role for pre-cu- 
ing in enhancing performance, and both showed 
that pre-cuing is a relatively fragile effect. Given 
the longer visual (Foxe and Simpson, 2002; Kro- 
lak-Salmon et al., 2001) than auditory (Naatanen, 
2001; Steinschneider et al., 1999; Yvert et al., 2001) 
system processing latencies described earlier, a 
mechanism that helps to initiate auditory attention 
in advance of the acoustic stimulus could provide a 
strategic advantage. 

In fact, none of the results obtained in the cur- 
rent study are support for speech- specific mecha- 
nisms in enhancing the auditory speech detection 
thresholds. If the preliminary mouth gesture were 
a cue that automatically engaged speech- specific 
mechanisms, it might be expected that it would 
function whenever present. But that is not what 
was found in Experiment 2. 

All of the visual stimuli (speech and non- 
speech) could have participated in early (possibly 
sub-cortical, superior colliculus) bottom-up, excit- 
atory-excitatory neural mechanisms that lead to 



response gain under multisensory stimulus condi- 
tions, particularly when a stimulus is at threshold 
level (Meredith, 2002). This type of response 
apparently does not require visual stimulus feature 
analysis beyond locating audiovisual correspon- 
dence in time and/or in space (Stein and Meredith, 
1993). 

6.1. Conclusions 

This study examined whether fine-grained 
audiovisual correlations are responsible for the 
AV detection enhancement effect reported by 
Grant (2001) and Grant and Seitz (2000). Here, 
comparisons between speech and non-speech stim- 
uli were used to investigate what is special about 
audiovisual speech processing, and what is more 
likely attributable to more general audiovisual pro- 
cessing capacities. Our results with the simple static 
visual non- speech (non-phonetic) stimulus suggest 
that fine-grained correlations are not the basis for 
the effect. Overall, the results did not support the 
hypothesis that the primary mechanism of the AV 
detection enhancement depends on perception of 
brief high positive correlations between lip area 
and acoustic amplitude peaks. The results across 
Experiments 1 and 2 support the conclusion that 
speech can provide a pre-cue that enhances acous- 
tic speech detection, but that the cue use is rela- 
tively fragile. The comparison across experiments 
raises a cautionary note for attributing effects to 
hypothesized mechanisms. Because the only visual 
speech stimulus in Experiment 1 produced a signif- 
icant advantage, the possibility was raised that 
visual speech was somehow special. Results from 
Experiment 2 were not consistent with that possi- 
bility. Consideration of known neurophysiological 
processing constraints informed the design of the 
current study. The results were seen to be consis- 
tent with the possibility that a bottom-up excit- 
atory-excitatory mechanism is responsible for the 
AV speech detection enhancement effect. 
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